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Abstract 

Background: A well-known problem in cluster analysis is finding an optimal number of clusters reflecting the inherent 
structure of the data. PFCIust is a partitioning-based clustering algorithm capable, unlike many widely-used clustering 
algorithms, of automatically proposing an optimal number of clusters for the data. 

Results: The results of tests on various types of data showed that PFCIust can discover clusters of arbitrary shapes, sizes 
and densities. The previous implementation of the algorithm had already been successfully used to cluster large 
macromolecular structures and small druglike compounds. We have greatly improved the algorithm by a more efficient 
implementation, which enables PFCIust to process large data sets acceptably fast. 

Conclusions: In this paper we present a new optimized implementation of the PFCIust algorithm that runs 
considerably faster than the original. 
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Introduction 

Cluster analysis [1] comprises methods designed to find 
structure in a dataset. Data can be divided into clusters 
that help us understand the problem domain, inform on- 
going investigation, or form input for other data analysis 
techniques. Clustering methods [2-7] attempt to find such 
clusters based only on the known relationships between 
the data objects. This distinguishes them from supervised 
data analysis approaches, such as classification methods 
[8] that are provided with right and wrong answers to 
guide their data analysis. One of the main challenges in- 
troduced by the lack of class labels is determining an opti- 
mal number of clusters that reflect the inherent structure 
present in the data. Exhaustive cluster enumeration be- 
comes impractical as the size and dimensionality of the 
data grow. We have developed a novel clustering tech- 
nique called PFCIust [9] that automatically discovers an 
optimum partitioning of the data without requiring prior 
knowledge of the number of clusters. PFCIust is also im- 
mune to the enumeration problem introduced by high- 
dimensional data, since it relies on a similarity matrix. 
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Here we give a brief overview of the algorithm and its ap- 
plications, and present a new efficient implementation. 

PFCIust 

PFCIust is based on the idea that each cluster can be repre- 
sented as a non-predetermined distribution of the intra- 
cluster similarities of its members. The algorithm partitions 
a dataset into clusters that share some common attributes, 
such as their minimum expectation value and variance of 
intra-cluster similarity. It is an agglomerative algorithm, 
starting with separated objects and progressively joining 
them together to form clusters. The algorithm attempts 
clustering using 20 threshold values, chosen using a ran- 
dom sampling technique, and then uses the Silhouette 
width to select which of the clusterings best describes the 
input dataset. 

Method 

PFCIust consists of two steps: threshold estimation and 
clustering. The threshold estimation procedure randomly 
splits the given dataset into clusters 1000 times and re- 
cords the expectation value of the intra-cluster similarities 
between members of the same cluster. Twenty threshold 
values from the top 5% of the distribution of mean intra- 
cluster similarities are selected as a representative range of 
possible thresholds, and are fed into the subsequent 
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clustering procedure. The clustering step of the algorithm 
is computationally more intensive than the threshold esti- 
mation step, and its complexity is O(kn^), where k is the 
number of clusters and n is the number of elements in the 
dataset. This is also the overall complexity of the PFClust 
algorithm. 

Without changing the computational complexity of 
the algorithm we have re-implemented it in the same 
programming language as the original (Java) with a care- 
ful selection of data types and appropriate bookkeeping, 
as the majority of operations are performed inside loops 
and involve intensive data structure manipulation. We 
have also taken advantage of the independence of the 20 
iterations of the clustering procedure and executed them 
in parallel. 



Performance evaluation 

We compare the performance of the original [9] and 
new PFClust implementations by measuring execution 
time on the following configuration: 

• Hardware: 2.2 GHz Intel(R) Core(TM) i5-3470S 
CPU @ 2.90 GHz, 8.00 GB RAM 

• Operating system: Scientific Linux release 6.3 
(Carbon) 

• JVM: 1.6.0_45-b06 

The running times of each step of the PFClust algo- 
rithm (threshold estimation, clustering, and the main it- 
eration that combines these steps) have been evaluated 
separately. Each step and the main iteration were 
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Figure 1 Execution times. Comparison of the execution times between tlie original (black, top row) and new (grey, bottom row) 
implementations, averaged over the seven datasets from [9]. The different steps of the algorithm (Randomization, Clustering and Total Execution 
time) are shown from left to right. The combined process of randomization and clustering has to be run four times (or occasionally more [9]), the 
totals given here include these repetitions. 
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executed 10 times, and average run times were obtained. 
The first step of the algorithm, involving random number 
generation, was initialized with the same seed in both 
implementations to keep the number of calculations ap- 
proximately constant. The clustering was executed with the 
same set of threshold values for each dataset in both pro- 
grams. The main iteration carried out the randomization 
step with the same seed and the clustering procedure with 
the same thresholds. 

The performance improvement of the new implementa- 
tion is primarily due to the representation of the similarity 
matrix and cluster objects. The old implementation used 
string objects as row and column names and looked up 
values in the similarity matrix based on these names. The 
names were stored in a vector, and searching for an elem- 
ent in a vector data type is 0(n) where n is the number of 
elements. Many operations involved two nested loops to 
search for the corresponding row and column names, 
which resulted in 0(n ) behaviour. The cluster objects in 
the old implementation were also backed by vectors of 
strings and involved intensive computations. There was an 
additional performance overhead related to synchronization 
of vectors, producing an overall performance bottleneck. 
The new implementation utilizes a two dimensional array 
of primitives to represent the similarity matrix and an 
ArrayList data type to represent cluster objects. The values 
are retrieved from the array or ArrayList based on the 
index, a constant time operation. Unlike the old implemen- 
tation, the new code utilizes bookkeeping with HashSet and 
ArrayList data types, where applicable, to decrease the 
number of operations inside the loops. In the threshold es- 
timation step, the data are now sorted before retrieving the 
required values from the array, whereas the values were se- 
lected in a brute-force fashion in the old implementation. 

The evaluation results (Figure 1) show that the execu- 
tion times are greatly improved. The clusterings result- 
ing from the two implementations agree closely, with a 
very high average Rand Index [10] of 0.985 over the 
seven datasets from [9]. 

Applications 

The previous implementation has already been success- 
fully used for biologically related problems with very 
promising results [9]. A set of protein domains taken from 
CATH [11] were clustered using a spherical polar Fourier 
shape-based representation [12,13]. PFClust proposed 11 
protein families and one singleton domain, whereas CATH 
clusters them into 11 families. While CATH superfamUies 
are based on protein structures that share a common fold, 
structures in the same superfamily might differ consider- 
ably [13]. Hence, approaches like PFClust could be used to 
refine the current families and to identify interesting out- 
liers or problematic cases. 



PFClust has also been successfully used to cluster a large 
number of small molecular structures [14]. ChEMBL [15] 
holds information on over 1,000,000 compounds and 
groups them into families according to their experimental 
bioactivities. These families were individually clustered 
using PFClust to create "refined" families which signifi- 
cantly improved the precision of our protein target 
predictions. 

Conclusion 

An efficient implementation of PFClust enabled us to run 
the program on all our synthetic datasets [9] acceptably 
fast. It processes the largest data set (5000 2D Vectors) in 
minutes, while the original implementation took several 
days. This new implementation can be now used effectively, 
not only for small datasets (< 1500) as previously shown, 
but also for larger ones (> 5000). 
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